Introduction

We found the Forest Covertype in the UCI Machine Learning Repository that takes forestry data from the Roosevelt National Forest in northern Colorado (Click here for a tour of the area). The observations are taken from 30m by 30m patches of forest that are classified as one of seven forest types:

  1. Spruce/Fir
  2. Lodgepole Pine
  3. Ponderosa Pine
  4. Cottonwood/Willow
  5. Aspen
  6. Douglas-fir
  7. Krummholz

The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Kaggle hosted the dataset in a competition with a training set of 15,120 observations and a test set of 565,892 observations. The relative sizes of the training and test sets makes classification of cover type a challenging problem. We decided to use the machine learning and visualization packages available in R for this project.

Data Exploration

Name Measurement Description
Elevation meters Elevation in meters
Aspect azimuth Aspect in degrees azimuth
Slope degrees Slope in degrees
Horizontal Distance To Hydrology meters Horz Dist to nearest surface water features
Vertical Distance To Hydrology meters Vert Dist to nearest surface water features
Horizontal Distance To Roadways meters Horz Dist to nearest roadway
Hillshade 9am 0 to 255 index Hillshade index at 9am, summer solstice
Hillshade Noon 0 to 255 index Hillshade index at noon, summer soltice
Hillshade 3pm 0 to 255 index Hillshade index at 3pm, summer solstice
Horizontal Distance To Fire Points meters Horz Dist to nearest wildfire ignition points
Wilderness Area (4 binary columns) 0 (absence) or 1 (presence) Wilderness area designation
Soil Type (40 binary columns) 0 (absence) or 1 (presence) Soil Type designation
Cover Type Classes 1 to 7 Forest Cover Type designation - Response Variable

Some class seperation is clearly visible in the following plots of elevation.

Another compelling variable is aspect, or the cardinal direction that the slope has the steepest gradient downwards. For example, in the rose diagram below, there are more Douglas-Fir trees for observations with northern aspects (near 0º) than southern aspects (near 180º).

Across Wilderness areas, there seems to be some class seperation . Cottonwood Willow covers are found only in Cache la Poudre areas and Neota area comprises of only Spruce-fir,Krummholz and LodgePole-Pine covers.

One of the clearer relationships in the dataset was the distribution of Hillshade luminance. Hillshade is measured on a scale from 0 to 255 (dark to bright).

Hillshade at time t varies as a factor of: \[\cos(slope)\cos (90- Altitude) + \sin (slope)\sin (90-Altitude)\cos(Azimuth-Aspect)\] where Altitude is the angle of the Sun relative to the horizon and Azimuth relates to the direction the Sun is facing: North, South, East, or West. Azimuth of 90 degrees corresponds to East.

This equation actually arises from a theorem in Spherical geometry known as “Spherical law of Cosines” relating the sides and angles of triangles constructed on spherical surfaces.


In a unit sphere, lengths a, b, c correspond to the angle subtended by those sides from the center of the sphere. If we know the two sides a, b and the angle between them C, then the cosine of c, is given by:

In short, the Illumination of the patch is related to alitude of the sun, slope of the terrain and the relative direction of the sun and the slope.

\(Hillshade(t_1,t_2,t_3)\) has has been plotted below in 3 dimensions:

More importantly, in the context of class seperation, The following plot of elevation, slope, and hillshade clearly seperates the forest cover types, demonstrating that elevation could be the most significant factor in determining cover type.

The primary reason for the collection of cartographic data pertains to terrain mapping and is ultimately useful for applying topographic correction to satellite images in remote sensing or as background information in scientific studies.

Topographic correction is necessary if, for example, we wish to identify materials on the Earth’s surface by deriving empirical spectral signatures, or to compare images taken at different times with different Sun and satellite positions and angles. By applying the corrections, it is possible to transform the satellite-derived reflectance into their true reflectivity or radiance in horizontal conditions.

Modeling

Logistic Regression

We first investigated multinomial logistic regression. Since classical multiple regression in sensitive to high variance of coefficient estimates in cases of correlated predictor variables, we decided to use the glmnet package, which executes logistic regressions with regularization. Using the dataset x, the probability of the response class k of 1 through 7 is the following:

\[\mbox{Pr}(G=k|X=x)=\frac{e^{\beta_{0k}+\beta_k^Tx}}{\sum_{\ell=1}^7e^{\beta_{0\ell}+\beta_\ell^Tx}}\]

We then transfer this into the elastic-net penalized negative log-likelihood function (the first term of the following equation):

\[\ell(\{\beta_{0k},\beta_{k}\}_1^7) = -\left[\frac{1}{N} \sum_{i=1}^N \Big(\sum_{k=1}^7y_{il} (\beta_{0k} + x_i^T \beta_k)- \log \big(\sum_{k=1}^7 e^{\beta_{0k}+x_i^T \beta_k}\big)\Big)\right] +\lambda \left[ (1-\alpha)||\beta||_F^2/2 + \alpha\sum_{j=1}^p||\beta_j||\right]\]

Where k is the response class (cover type from 1 to 7) and N is the number of observations. The second term in the equation represents the regularization term.

The elastic net penalty \(\alpha\) varies from 0 to 1. When \(\alpha = 0\), the function reduces to Ridge regularization. When \(\alpha = 1\), Lasso regularization is used. The Ridge penalty shrinks the coefficients of closely correlated predictors wheras the Lasso tries to pick one and discard others.

Ridge Regularization

We performed 10-fold cross validation on a ridge regularization and on a grid of 100 values of \(\lambda\), choosing the minimum \(\lambda\) for each model. When we ran the regression with a 70-30 cross validation split on the training set, we got an accuracy on the test set on Kaggle of 0.5570.

When we reran the ridge regression with an 85-15 split on the training data, we got a Kaggle accuracy of 0.5956 (1,414th place). Since the test set is so much larger than our training set, it was beneficial to incorporate a larger proportion of the data for learning within the training data before runnint the model on the larger testing data.

Lasso Regularization

A 10-fold cross validated lasso-regularized logistic regression kaggle score was 0.59594 for 70-30 split kaggle score of 0.59526, for 85-15 split kaggle score of 0.59443 if we took the lowest 50 percentile of lambdas and calculated the mode of the predictions.

Elastic-net Regularization

Using both the ridge and lasso regularization terms concurrently produces an elastic-net regularization. We ran this hybrid method with both a 70-30 split and 85-15 split in the training set. When running these models on the testing set, we got accuracies on Kaggle of 0.5959 and 0.5952, respectively.

We also tried an elastic-net regression using all 15,120 observations in the training set with an Alpha parameter of 0.5. It earned an accuracy score on the testing set of 0.5950.

Artificial Neural Networks

We first modeled the data in a feed-forward neural network with a single layer of nodes. Later, we created deep learning models where we added more layers of hidden nodes.

Single Hidden Layer Using the nnet Package

The nnet R library allows the construction of feed-forward single-layer nerual networks. After scaling our variables, and a basic run we decided to tune the nnet function based on the following parameters - Number of nodes - Number of iterations - Maximum number of weights (since this depends on the number of nodes)

we experimented to see the optimal number of nodes for our hidden layer by checking our accuracy levels over various number of nodes.

We finally chose a 54-100-7 node neural nets on our training data achieving 0.52113 accuracy for 300 maximum iterations and 0.52432 accuracy for 500 maximum iterations.

Deep Learning using H2o

For adding multiple hidden layers, we used the deeplearning function using the H2o package. H2o is an R package that runs parallel-distributed machine learning algorithms to improve computational speed. Later, we also used H2o’s Gradient Boosting Machine (GBM) algorithm to acheive our second best accuracy rate.

Some features of H2o include - In-memory compression techniques to handle large datasets - Automatic standardization - Randomized Grid Search - Fast algorithms deployed in Java - Anomaly detection to identify outliers

When we first ran a basic model, our validation results were poor and we decided to experiment with varying the default parameters using a grid search. We tried variations of one to three hidden layers with between 30 to 200 nodes each. We also varied the input dropout ratio, the fraction of training features to omit for improving generalizability. The next parameter was the rate, or how much the network adjusts over time. Finally, we tuned the rate annealing parameter that reduces the liklihood that the learning rate chooses local minima while optimizing. H2o allows for “hyper-parameter” tuning in the following format:

hyper_params <- list(
   hidden=list(c(30),c(50),c(70),c(100),c(130),c(160),c(200),c(250),c(30,60),c(60,90),c(90,120),c(120,150),c(30,60,90),c(60,90,120),c(90,120,160)), 
   input_dropout_ratio=c(0,0.05),
   rate=c(0.01,0.02),
   rate_annealing=c(1e-8,1e-7,1e-6))

From our first tuning run, the neural network had an accuracy of 0.60186 with a 54-120-150-7 network.

After returning to the model, we decided to experiment with a wider range of the tuning parameters. Our best model was a 54-500-800-7 network with a learning rate of 0.02, rate annealing of 1e-05, input dropout ratio of 0, and an accuracy of 0.674.

Notable studies from the past:

The earliest use of the dataset was in a doctoral dissertation. Jock Blackard and Denis Dean at Colorado State University used Linear Discriminant Analysis to obtain a 58% accuracy and Artificial Neural network to achieve 70% accuracy. The paper concludes by saying that ANN does have its drawbacks, it is still more viable than traditional methods, by which we think he refers to logistic regression, LDA and Support Vector Classifiers.
In 2007, Jain & Minz (Journal of Indian Agricultural Statistics) applied Rough Set Theory as an alternative to Machine learning approaches to achieve equally good but using lesser attributes (only 29) to achieve 78% accuracy for the forest cover dataset. This is especially relevant in real world classification applications in remote sensing involving large datasets.
In 2004, in “Systems, Man and Cybernetics” an IEEE International Conference Journal publication, Xiaomei Liu, Kevin W. Bowyer of the University of Notre Dame use the forest cover dataset to conjecture something interesting. In their own words:
“Based on some experiments with the Forest Cover Type Data Set, we conjectured a class of problems that are extremely difficult for FFBP NNs but relatively simple for DTs. The features of such problems are that some minority classes are surrounded and overlapped by some majority classes. We generated synthetic data sets with the above description. In our experiments on the synthetic data sets, we varied the size and the number of the islands for the minority class. We also varied the ratio of the minority class and the majority class for each island. We tried a plain feed forward back propagation NN, a Rprop NN, and a cascade correlation NN. With all of these experiments, the DT performed better than each kind of FFBP NN”
Based on these findings, we decided to move on from neural networks in favor of tree-based methods.

Tree Based Methods

Random Forest

Boosting Using XGBoost

Boosting Using H2o’s Gradient Boosting Algorithm

Extremely Randomized Trees

Support Vector Machines ?

Ensembling

Results Summary

Model

Logistic Regression - Ridge Lasso Elastic-net

Conclusion

“The general characteristics of this class of problems is that some minority classes are distributed in small regions that are surrounded and overlapped with majority classes.” “When there is a minority class that is distributed in small isolated regions of featurespace, the BP learning mechanism seems to have great trouble ”finding” the regions”

References:

http://www.isas.org.in/jsp/volume/vol62/Apr-08/Apr-08-9.pdf https://www.researchgate.net/publication/222499246_Comparative_Accuracies_of_Artificial_Neural_Networks_and_Discriminant_Analysis_in_Predicting_Forest_Cover_Types_from_Cartographic_Variables

http://www.csee.usf.edu/~hall/papers/smc04nn.pdf